A Grammar Based Analysis of Column Header Categories for Web Tables
نویسندگان
چکیده
As part of a project to harvest semi-structured data from web tables, we describe an approach to extract an abstract representation of the column-header categories based on a context-free grammar for linear strings. The column-header structure is generally an XY-tessellation. The grammar provides a compact representation of infinitely many structural variations possible within column headers. Before parsing, the 2D column-header structure is converted to a linear string of its atomic cell labels and delimiters for the X and Y cuts. The acceptable strings represent a superset of admissible column-header structures from which the invalid ones are eliminated by performing geometric and lexical checks on the labels of the parse tree. Experimental results on web tables show that 80% of the headers in the sample could be processed successfully using the grammatical approach.
منابع مشابه
Clustering header categories extracted from web tables
Revealing related content among heterogeneous web tables is part of our long term objective of formulating queries over multiple sources of information. Two hundred HTML tables from institutional web sites are segmented and each table cell is classified according to the fundamental indexing property of row and column headers. The categories that correspond to the multi-dimensional data cube vie...
متن کاملRecovering Semantics of Tables on the Web
The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Our annotations facilitate operations such as sear...
متن کاملDevelopment of a site-specific regression model for assessment of road-header cutting performance of Tabas coal mine based on rock properties
In underground excavation, where the road-headers are employed, a precise prediction of the road-header performance has a vital role in the economy of the project. In this paper, a new model is developed for prediction of the road-header performance using the non-linear multivariate regression analysis. This model is able to estimate the instantaneous cutting rate (ICR) of roadheader based on r...
متن کاملAuthor Manuscript, Published in "actes Du 27e Colloque International Sur Le Lexique Et La Grammaire a Generic Tool to Generate a Lexicon for Nlp from Lexicon-grammar Tables
Symbolic approaches to deep parsing often require large-coverage and fine-grained lexical information, such as a syntactic lexicon. LexiconGrammar tables (Gross 1975, 1994), carefully developed by linguists since the 70s, constitute such a syntactic resource. Each table represents a class of predicates sharing some syntactic features. Each row corresponds to a lexical entry (verb, predicative n...
متن کاملA Tiled-Table Convention for Compressing FITS Binary Tables
This document describes a convention for compressing FITS binary tables that is modeled after the FITS tiled-image compression method (White et al. 2009) that has been in use for about a decade. The input table is first optionally subdivided into tiles, each containing an equal number of rows, then every column of data within each tile is compressed and stored as a variable-length array of byte...
متن کامل